54 research outputs found

    Increasing the degree of parallelism using speculative execution in task-based runtime systems

    Get PDF
    Task-based programming models have demonstrated their efficiency in the development of scientific applications on modern high-performance platforms. They allow delegation of the management of parallelization to the runtime system (RS), which is in charge of the data coherency, the scheduling, and the assignment of the work to the computational units. However, some applications have a limited degree of parallelism such that no matter how efficient the RS implementation, they may not scale on modern multicore CPUs. In this paper, we propose using speculation to unleash the parallelism when it is uncertain if some tasks will modify data, and we formalize a new methodology to enable speculative execution in a graph of tasks. This description is partially implemented in our new C++ RS called SPETABARU, which is capable of executing tasks in advance if some others are not certain to modify the data. We study the behavior of our approach to compute Monte Carlo and replica exchange Monte Carlo simulations

    TBFMM: A C++ generic and parallel fast multipole method library

    Get PDF
    International audienceTBFMM, for task-based FMM, is a high-performance package that implements the parallel fast multipole method (FMM) in modern C++17. It implements parallel strategies for multicore architectures, i.e. to run on a single computing node. TBFMM was designed to be easily customized thanks to C++ templates and fine control of the C++ classes inter-dependencies. Users can implement new FMM kernels, new types of interacting elements or even new parallelization strategies. As such, it can effectively be used as a simulation toolbox for scientists in physics or applied mathematics. It enables users to perform simulations while delegating the data structure, the algorithm and the parallelization to the library. Besides, TBFMM can also provide an interesting use case for the HPC research community regarding parallelization, optimization and scheduling of applications handling irregular data structures

    Impact study of data locality on task-based applications through the Heteroprio scheduler

    Get PDF
    International audienceThe task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution

    SPETABARU: A Task-based Runtime System with Speculative Execution Capability

    Get PDF
    International audienceWhile task-based programming models allow expressing the parallelism of algorithms finely, the traditional data accesses used in the sequential task-flow model (STF) can restrict the parallelism and hide useful information. In this presentation, we describe how more precise data accesses can be used to get better performance, and how uncertain modifications of the data by the tasks open the possibility for speculative execution. We detail different speculative execution models when this uncertainty exists. We also introduce our speculative runtime system, SPETABARU, and provide examples with the parallelization of the Monte Carlo and replica exchange Monte Carlo simulations

    Time-Domain BEM for the Wave Equation on Distributed-Heterogenous Architectures : a Blocking Approach

    Get PDF
    The problem of time-domain BEM for the wave equation in acoustics and electromagnetism can be expressed as a sparse linear system composed of multiple interaction/convolution matrices. It can be solved using sparse matrix-vector products which are inefficient to achieve high Flop-rate whether on CPU or GPU. In this paper we extend the approach proposed in a previous work~\cite{bib:bramas} in which we re-order the computation to get a special matrices structure with one dense vector per row. This new structure is called a slice matrix and is computed with a custom matrix/vector product operator. In this study we present an optimized implementations of this operator on Nvidia GPU based on two blocking strategies. We explain how we can obtain multiple block-values from a slice and how these ones can be computed efficiently on GPU. We target heterogeneous nodes composed of CPU and GPU. In order to deal with the different efficiencies of the processing units we use a greedy heuristic that dynamically balances the work among the workers. We demonstrate the performance of our system by studying the quality of the balancing heuristic and the sequential Flop-rate of the blocked implementations. Finally, we validate our implementation with an industrial test case on 8 heterogeneous nodes each composed of 12 CPU and 3 GPU

    An Efficient Particle Tracking Algorithm for Large-Scale Parallel Pseudo-Spectral Simulations of Turbulence

    Get PDF
    Particle tracking in large-scale numerical simulations of turbulent flows presents one of the major bottlenecks in parallel performance and scaling efficiency. Here, we describe a particle tracking algorithm for large-scale parallel pseudo-spectral simulations of turbulence which scales well up to billions of tracer particles on modern high-performance computing architectures. We summarize the standard parallel methods used to solve the fluid equations in our hybrid MPI/OpenMP implementation. As the main focus, we describe the implementation of the particle tracking algorithm and document its computational performance. To address the extensive inter-process communication required by particle tracking, we introduce a task-based approach to overlap point-to-point communications with computations, thereby enabling improved resource utilization. We characterize the computational cost as a function of the number of particles tracked and compare it with the flow field computation, showing that the cost of particle tracking is very small for typical applications

    Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing

    Get PDF
    International audienceHigh-performance computing (HPC) relies increasingly on heterogeneous hardware and especially on the combination of central and graphical processing units. The task-based method has demonstrated promising potential for parallelizing applications on such computing nodes. With this approach, the scheduling strategy becomes a critical layer that describes where and when the ready-tasks should be executed among the processing units. In this study, we describe a heuristic-based approach that assigns priorities to each task type. We rely on a fitness score for each task/worker combination for generating priorities and use these for configuring the Heteroprio scheduler automatically within the StarPU runtime system. We evaluate our method’s theoretical performance on emulated executions and its real-case performance on multiple different HPC applications. We show that our approach is usually equivalent or faster than expert-defined priorities

    Optimization of a discontinuous Galerkin solver with OpenCL and StarPU

    Get PDF
    International audienceSince the recent advance in microprocessor design, the optimization of computing software becomes more and more technical. One of the difficulties is to transform sequential algorithms into parallel ones. A possible solution is the task-based design. In this approach, it is possible to describe the parallelization possibilities of the algorithm automatically. The task-based design is also a good strategy to optimize software in an incremental way. The objective of this paper is to describe a practical experience of a task-based parallelization of a Discontinuous Galerkin method in the context of electromagnetic simulations. The task-based description is managed by the StarPU runtime. Additional acceleration is obtained by OpenCL

    MulTreePrio: Scheduling task-based applications for heterogeneous computing systems

    Get PDF
    National audienceEffective scheduling is crucial for task-based applications to achieve high performance in heterogeneous computing systems. These applications are usually represented by directed acyclic graphs (DAG). In this paper, we present a dynamic scheduling technique for DAGs intending to minimize the overall completion time of the parallelized applications. We introduce MulTreePrio, a novel scheduler based on a set of balanced trees data structure. The assignment of tasks to available resources is done according to priority scores per task for each type of processing unit. These scores are computed through heuristics built according to a set of rules that our scheduler should fulfil. We simulate the scheduling on three DAGs coming from numerical kernels with different configurations and we compare its behavior with both dynamic schedulers and static scheduling techniques based on the critical path. We show the efficiency of our scheduler with an average speedup of x2 with respect to the dynamic scheduler and x0,99 compared to the critical path-based scheduler. MulTreePrio is promising and in future works, it will be integrated into a task-based runtime system and tested in real-life scenarios
    • …
    corecore